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Abstract 

The Rao-Blackwell theorem is utilized to analyze and 
improve the scalability of inference in large probabilis- 
tic models that exhibit symmetries. A novel marginal 
density estimator is inU'oduced and shown both analyti- 
cally and empirically to outperform standard estimators 
by several orders of magnitude. The developed theory 
and algorithms apply to a broad class of probabilistic 
models including statistical relational models consid- 
ered not susceptible to lifted probabilistic inference. 

Introduction 

Many successful applications of artificial intelligence re- 
search are based on large probabilistic models. Examples 
include Markov logic networks (Richardson and Domingos 
2006), conditional random fields (Lafferty, McCallum, and 



Pereira 2001)1 and, more recently, deep learning architec- 



tures (|Hinton, O sindero, and Teh 2006 1 |Bengio and LeCun 
|2007[|Poon and Domingos 201 1\ . Especially the models one 
encounters in the statistical relational learning (SRL) litera- 
ture often have joint distributions spanning millions of vari- 
ables and features. Indeed, these models are so large that, 
at first sight, inference and learning seem daunting. For nu- 
merous of these models, however, scalable approximate and, 
to a lesser extend, exact inference algorithms do exist. Most 
notably, there has been a strong focus on lifted inference 
algorithms, that is, algorithms that group indistinguishable 
variables and features during inference. For an overview 
we refer the reader to (Kersting 2012). Lifted algorithms 
facilitate efficient inference in numerous large probabilistic 
models for which inference is NP-hard in principle. 

We are concerned with the estimation of marginal prob- 
abilities based on a finite number of sample points. We 
show that the feasibility of inference and learning in large 
and highly symmetric probabilistic models can be explained 
with the Rao-Blackwell theorem from the field of statistics. 
The theory and algorithms do not directly depend on the 
syntactical nature of the relational models such as arity of 
predicates and number of variables per formula but only on 
the given automorphism group of the probabilistic model, 
and are applicable to classes of probabilistic models much 
broader than the class of statistical relational models. 
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Consider an experiment where a coin is flipped n times. 
While a frequentist would assume the flips to be i.i.d., a 
Bayesian typically makes the weaker assumption of ex- 
changeability - that the probability of an outcome sequence 
only depends on the number of "heads" in the sequence and 
not on their order. Under the non-i.i.d. assumption, a pos- 
sible corresponding graphical model is the fully connected 
graph with n nodes and high treewidth. The actual number 
of parameters required to specify the distribution, however, 
is only n+1, one for each sequence with < k < n "heads." 
Bruno de Finetti was the first to realize that such a sequence 
of random variables can be (re-)parameterized as a unique 
mixture of n+1 independent urn processes (de Finetti 1938). 
It is this notion of a parameterization as a mixture of urn pro- 
cesses that is at the heart of our work. A direct application of 
de Finetti's results, however, is often impossible since not all 
variables are exchangeable in realistic probabilistic models. 

Motivated by the intuition of exchangeability, we show 
that arbitrary model symmetries allow us to re-paramterize 
the distribution as a mixture of independent urn processes 
where each urn consists of isomorphic joint assignments. 
Most importantly, we develop a novel Rao-Blackwellized 
estimator that implicitly estimates the fewer parameters of 
the simpler mixture model and, based on these, computes 
the marginal densities. We identify situations in which the 
application of the Rao-Blackwell estimator is tractable. In 
particular, we show that the Rao-Blackwell estimator is al- 
ways linear-time computable for single-variable marginal 
density estimation. By invoking the Rao-Blackwell theo- 
rem, we show that the mean squared error of the novel esti- 
mator is at least as small as that of the standard estimator and 
strictly smaller under non-trivial symmetries of the proba- 
bilistic model. Moreover, we prove that for estimates based 
on sample points drawn from a Markov chain A4 , the bias of 
the Rao-Blackwell estimator is governed by the mixing time 
of the quotient Markov chain whose convergence behavior 
is superior to that of A4 . 

We present empirical results verifying that the Rao- 
Blackwell estimator always outperforms the standard esti- 
mator by up to several orders of magnitude, irrespective of 
the model structure. Indeed, we show that the results of the 
novel estimator resemble those typically observed in lifted 
inference papers. For the first time such a performance is 
shown for an SRL model with a transitivity formula. 



Background 

We review some concepts from group and estimation theory. 

Group Theory A group is an algebraic structure (©,o), 
where © is a set closed under a binary associative operation 

with an identity element and a unique inverse for each el- 
ement. We often write © rather than (©, o). A permutation 
group acting on a set 51 is a set of bijections g : ft — >• ft that 
form a group. Let ft be a finite set and let © be a permuta- 
tion group acting on 11. If a G 11 and g £ 25 we write a 3 
to denote the image of a under g. A cycle (ai a 2 ... a„) 
represents the permutation that maps a\ to a 2 , a 2 to a^,---, 
and a n to a\. Every permutation can be written as a product 
of disjoint cycles. A generating set R of a group is a subset 
of the group's elements such that every element of the group 
can be written as a product of finitely many elements of R 
and their inverses. 

We define a relation ~onfi with a ~ /3 if and only if 
there is a permutation g £ © such that a 3 = (5. The relation 
partitions SI into equivalence classes which we call orbits. 
We call this partition of ft the orbit partition induced by ©. 
We use the notation a to denote the orbit {a 3 \ g £ 25} 
containing a. For a permutation group © acting on SI and a 
sequence A = (ai, ..., cufc) £ ft k we write A 9 to denote the 
image (a\ 9 , ..., a^ 9 ) of A under g. Moreover, we write A & 
to denote the orbit of the sequence A. 

Point Estimation Let si,...,sn be N sample points 
drawn from some distribution P. An estimator 9n of a pa- 
rameter 9 is a function of s i , . . . , s^ ■ The bias of an estimator 
is defined by bias(6*Ar) := E[6^ — 9} and the variance by 
Var(^jv) := E[(0/v - E(9 N )) 2 ], where E is the expectation 
with respect to P, the distribution that generated the data. 
We say that On is unbiased if bias(6*Ar) = 0. The qual- 
ity of an estimator is often assessed with the mean squared 
error (MSE) defined by MSE[0jv] := E[(0jv - 9) 2 ] = 
Var{9 N )+bias(9 N ) 2 . 

Theorem 1 (Rao-Blackwell). Let 9 be an estimator with 
E[0 2 ] < oo and T a sufficient statistic both for 9, and let 
9* := E[§ | T]. Then, MSE[9*} < MSE[9}. Moreover, 
MSE[9*] < MSE[9] unless 0* is a function of 9. 

Finite Markov chains A finite Markov chain M. defines a 
random walk on elements of a finite set fi. For all x, y £ SI, 
Q(x, y) is the chain's probability to transition from x to y, 
and Q t (x, y) = Qx(y) the probability of being in state y 
after t steps if the chain starts at x. A Markov chain is 
irreducible if for all x, y £ ft there exists a t such that 
Q t (x,y) > and aperiodic if for all x £ ft, gcd{t > 

1 | Q l (x, x) > 0} = 1. An irreducible and aperiodic chain 
converges to its unique stationary distribution and is called 
ergodic. 

The total variation distance d tv of the Markov chain from 
its stationary distribution 7r at time t with initial state x is 
defined by 



For e > 0, let t x (s) denote the least value T such that 
dtviQx,^) < £ for all t > T. The mixing time r(e) is 
defined by r(e) = max{r a .(£) | x £ ft}. 

Related Work 

There are numerous lifted inference algorithms such as 
lifted variable elimination (Poole 2003), lifted belief 



propagation (|Singla and Do mingos 2008 Kersting, Ah- 



madi, and Natarajan 2009 1, first-order knowledge compi- 
lation (|Van den Broeck 2011), and lifted variational infer- 
ence (Choi and Am ir 2012| l. Probabilistic theorem proving 
applied to a clustering of the relational model was used to 



lift the Gibbs sampler ( Venugopal and Gogate 2012) l. Recent 
work exploits automorphism groups of probabilistic models 



for more efficient probabilistic inference (Bui, Huynh, and 
Riedel 20 1 2"} |Niepert 20T2) . Orbital Markov chains ( |Niepert 
2012| lareaclass of Markov chains that implicitly operate on 
the orbit partition of the assignment space and do not invoke 
the Rao-Blackwell theorem. 

Rao-Blackwellized (RB) estimators have been used for 
inference in Bayesian networks ( |Doucet et al. 200 0, Bidyuk 
and Dechter 2007 ) and latent Dirichlet allocation (Teh, New- 
man, and Welling 2 006]l with applic ation in robotics (Stach 
niss, Grisetti, and Burgard 2005) and activity recogni- 



tion ( ]Bui,~V enkatesh, and We st 2002] l. The RB theorem and 
estimator are important concepts in statistics ( jGelfand and| 
ISmith 19901 |Casella and Robert 1996) . 

Symmetry- Aware Point Estimation 

An automorphism group of a probabilistic model is a 
group whose elements are permutations of the probabilis- 
tic model's random variables X that leave the joint distri- 
bution .P(X) invariant. There is a growing interest in com- 
puting and utilizing automorphism groups of probabilistic 
models for more efficient inference algorithms ( |Bui, Huynh, 



and Riedel 2012, Niepert 2012). The line of research is pri 
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marily motivated by the highly symmetric nature of statis- 
tical relational models and provides a complementary view 
on lifted probabilistic inference. Here, we will not be con- 
cerned with deriving automorphism groups of probabilistic 
models but with developing algorithms that utilize these per- 
mutation groups for efficient marginal density estimation. 
Hence, we always assume a given automorphism group © 
of the probabilistic model under consideration. 

We begin by deriving a re-parameterization of the joint 
distribution in the presence of symmetries that generalizes 
the mixture of independent urn processes parameterization 
for finitely exchangeable variables (Diaconis and Freedman 
1980). All random variables are assumed to be discrete. 

Let X = (Xi, ...,X n ) be a finite sequence of discrete 
random variables with joint distribution P(X), let © be an 
automorphism group of X, and let O be an orbit partition of 
the assignment space induced by ©. Please note that for any 
x, x' £ O £ O we have P(x) = -P(x'). For a subsequence 
X of X and an orbit OeOwe write P(X = x | O) for the 
marginal density P(X = x) conditioned on O. Thus, 

1 ' xeo 
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Figure 1 : Illustration of the orbit partition of the assignment 
space induced by the renaming automorphism group {(A B), 
()}. A renaming automorphism is a permutation of constants 
that forms an isomorphism between two graphical models, 
(a) An MLN with three formulas and the grounding for two 
constants A and B; (b) the state space of a Gibbs chain with 
non-zero transitions indicated by lines and without self-arcs; 
(c) the lumped state space of the quotient Markov chain 
which has 10 instead of 16 states. The joint distribution can 
be expressed as a mixture of draws from the orbits. 



where I is the indicator function and x(X) the assignment 
within x to the variables in the sequence X. We can now 
(re-)parameterize the marginal density as a mixture of inde- 
pendent orbit distributions 



P(X 



E P (± 



x | 0)P(0), 



oeo 



where P(0) — J2 x eo -P(X = x). For instance, the joint 
distribution of the Markov logic network in Figure [TTa) can 
be parameterized as a mixture of the distributions for the 10 
orbits depicted in Figure [TJc). 

Let us first recall the standard estimator used in most 
sampling approaches. After collecting N sample points 
Si,...,3jv the standard estimator for the marginal density 
:= PCX = x) is defined as 



7JV 



1 N 



8 = 1 



(1) 



Now, the symmetry-aware Rao-Blackwell estimator for N 
sample points Si, ..., Sjv is defined as 



9rb 
N 



1 N 



\si e ), 



(2) 



where © is the given automorphism group that induces O. 
Hence, the unbiased Rao-Blackwell estimator integrates out 
the joint assignments of each orbit. We will prove that the 
mean squared error of the Rao-Blackwell estimator is less 



than or equal to that of the standard estimator. First, how- 
ever, we want to investigate under what conditions we can 
efficiently compute the conditional density of equation (2). 
To this end, we establish a connection between the orbit of 
the subsequence X under the automorphism group © and 
the orbit partition of the assignment space induced by (qH 

Definition 2. Let X be a finite sequence of random vari- 
ables with joint distribution -P(X), let © be an automor- 
phism group ofX, let ~X.be a subsequence of IK., let Val(X) 
be the assignment space of X, and let s € Val(X). The orbit 
Hamming weight of s with respect to the marginal assign- 
ment X = xij defined by 

H f=*( S ) : = E W>=*}. 

Aex® 

Based on this definition, we state a lemma which allows 
us to compute the density of equation (2) in closed form, 
without having to enumerate all of the orbit's elements. 

Lemma 3. Let IK. be a finite sequence of random vari- 
ables with joint distribution P(X), let © be an automor- 
phism group of X, let IK be a subsequence of X, and let 
s e Val(X). Then, 



H? (s) 

pcx. = x i s & ) = x r* 

|X«| 



= E[9 N | H| = 



>(*)]■ 



The following example demonstrates the application of 
the lemma to the special case of single-variable marginal 
density estimation for the MLN in Figure [T] 

Example 4. Let us assume we want to estimate the marginal 
density P(smokes(A)=l) of the MLN in FigureUja). Since 
© = {(smokes(A) smokes(B))(cancer(A) cancer(B)), ()} 
we have that (smokes(A)) & ={(smokes(A)) , (smokes(B))}. 
Given the sample point s = (1,0,1,0) we have that 



(s) = 1 and P(smokes(A)=l Is )- 1 



2' 



|_|0 

(smokes(A))=(V) 

Thus, given a sample point s, the marginal density con- 
ditioned on an orbit of the assignment space is computable 
in closed form using the orbit Hamming weight of s with re- 
spect to the marginal assignment since it is a sufficient statis- 
tic for the marginal density. If the probabilistic model ex- 
hibits symmetries, then the Rao-Blackwell estimator's MSE 
is less than or equal to that of the standard estimator. 

Theorem 5. Let X be a finite sequence of random vari- 
ables with joint distribution PCX), let © be an automor- 
phism group of X given by a generating set R, let X be a 
subsequence ofX, and let 6 := PCX = x) be the marginal 
density to be estimated. The Rao-Blackwell estimator O 1 ^ 
has the following properties: 

(a) Its worst-case time complexity is 0(R\X & |+A^|X® |); 

(b) MSE[e^] < MSE[§ N ]. 

The inequality of(b) is strict if there exists a joint assignment 
s with non-zero density and < H? (s) < |X® | > 1. 

'Please note the two different types of orbit partitions discussed 
here. One results from 25 acting on the assignment space the other 
from & acting on sequences of random variables. 



For single-variable density estimation the worst-case time 
complexity of the Rao-Blackwell estimator is (9(i?|X| + 
iV|X|) and, therefore, linear both in the number of variables 
and the number of sample points. For most symmetric mod- 
els, the inequality of Theorem |5Jb) is strict and the Rao- 
Blackwell estimator outperforms the standard estimator, a 
behavior we will verify empirically. 

Please note that in the special case of single-variable 
marginal density estimation the RB estimator is identical to 
the estimator that averages the identically distributed vari- 
ables located in the same orbit. The advantages of utilizing 
the Rao-Blackwell theory are (1) it directly provides con- 
ditions for which the inequality of Theorem |5jb) is strict; 
(2) it generalizes the single-variable case to marginals span- 
ning multiple variables; (3) it allows us to investigate the 
completeness of an estimator with respect to a given auto- 
morphism group; and (4) it provides the link to the quotient 
Markov chain in the MCMC setting and its superior conver- 
gence behavior presented in the following section. 

The Rao-Blackwell estimator is unbiased if the drawn 
sample points are independent. Since it is often only prac- 
tical to collect sample points from a Markov chain, the bias 
for a finite number of N points will depend on the chain's 
mixing behavior. We will show that if there are non-trivial 
model symmetries and if we are using the Rao-Blackwell 
estimator, we only need to worry about the mixing behavior 
of the Markov chain whose state space is the orbit partition. 

Symmetry-Aware MCMC 

Whenever we collect sample points from a Markov chain, 
the efficiency of an estimator is influenced by (a) the mix- 
ing behavior of the Markov chain and (b) the variance of the 
estimator under the assumption that the Markov chain has 
reached stationarity, that is, the asymptotic variance (Neal 



lowing theorem states this and the convergence behavior of 
the quotient Markov chain in relation to the original Markov 
chain (cf. ( |Boyd et al. 2005) ). 



2004). That the Rao-Blackwell estimator's asymptotic vari- 



ance is at least as low as that of the standard estimator is 
a corollary of Theorem [5] We show that the same is true 
for the bias that is caused by the fact that we collect a finite 
number of sample points from Markov chains which never 
exactly reach stationarity. 

A lumping of a Markov chain is a partition of its state 
space which is possible under certain conditions on the tran- 
sition probabilities of the original Markov chain ( |Buchholz| 
1994; Derisav i, Hermanns, and Sa nders 2003]). 

Definition 6. Let M be an ergodic Markov chain with tran- 
sition matrix Q, stationary distribution tt, and state space fl, 
and let C — {C\, ..., C n } be a partition of the state space. If 
for all Ci, Cj £ C and all s%' , s/' £ Ci 



Q (Ci, Cj ) 



s t t O i s -j tr C* 4 



then we say that M. is ordinary lumpable with respect to C. 
If in addition, 7r(sQ = 7r(s") for all s'^s" £ Ci and all 
Ci £ C then M is exactly lumpable with respect to C. The 
Markov chain M 1 with state space C and transition matrix 
Q' is called the quotient chain of M. with respect to C. 

Every finite ergodic Markov chain is exactly lumpable 
with respect to an orbit partition of its state space. The fol- 



Proposition 7. Let M. be an ergodic Markov chain and let 
O be an orbit partition of its state space. Then, the Markov 
chain M is exactly lumpable with respect to O. If M. is 
reversible, then the quotient Markov chain M.' with respect 
to O is also reversible. Moreover, the mixing time of M! is 
smaller than or equal to the mixing time of M.. 

Example 8. FigureU\bj depicts the state space of the Gibbs 
chain for the MLN shown in FigureU\a). The constants re- 
naming automorphism group {(A B), ()} acting on the sets 
of constants leads to the automorphism group {(smokes(A) 
smokes(B))(cancer(A) cancer(B)), ()} on the ground level. 
This permutation group acting on the state space of the 
Gibbs chain induces an orbit partition which is the state 
space of the quotient Markov chain (see FigureUlcjj. 

The explicit construction of the state space of a quotient 
Markov chain is intractable. Given an automorphism group 
<S, merely counting the number of equivalence classes of 
the orbit partition of the assignment space induced by & 
is known to be a #P-complete problem (Gold berg 200 1) . 
Nevertheless, if the Rao-Blackwell estimator is utilized, one 
can draw the sample points from the original Markov chain 
while analyzing the convergence behavior of the quotient 
Markov chain of the original chain. 

Theorem 9. Let X be a finite sequence of random variables 
with joint distribution -P(X), let M. be an ergodic Markov 
chain with stationary distribution P, and let O be an orbit 
partition ofAi's state space. Let 9 T ^ be the Rao-Blackwell 
estimator for N sample points st+i, ■■■, st+n collected 
from Ai, after discarding the first T sample points. Then, 
|bias(0jy )| < e ifT > t' (e), where r'(e) is the mixing time 
of the quotient Markov chain of M with respect to O. 

Hence, if one wants to make sure that the absolute value 
of the bias of the Rao-Blackwell estimator is smaller than 
a given e > 0, one only needs a bum-in period consisting 
of r'(e) simulation steps, where r'(e) is the mixing time of 
the quotient Markov chain. Existing work on analyzing the 
influence of symmetries in random walks has shown that it 
is often more convenient to investigate the mixing behavior 
of the quotient Markov chain (Boyd et al. 2005[ ). In the con- 
text of marginal density estimation, Markov chains implic- 
itly operating on the orbit partition of the assignment space 
were shown to have better mixing behavior (Niepert 2012). 

In summary, whenever probabilistic models exhibit non- 
trivial symmetries we can have the best of both worlds. The 
bias owed to the fact that we are collecting a finite number of 
sample points from a Markov chain as well as the asymptotic 
variance (Neal 2004) of the Rao-Blackwell estimator are at 
least as small as those of the standard estimator. The more 
symmetric the probabilistic model the larger the reduction in 
mean squared error. 

We now present the experimental results for several large 
probabilistic models, both relational and non-relational. 
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(d) The smokes-cancer MLN with 50 (e) The 2-coloring 100 x 100 grid model (f) The 2-coloring 100 x 100 grid model 
people, 10% evidence, and transitivity. with weight 0.2. with hard constraints. 

Figure 2: Plots of average KL divergence versus time in seconds of the two MCMC algorithms with the standard estimator 
(standard) and the Rao-Blackwell estimator (aggregated) for various probabilistic models. 



Experiments 



The aim of the empirical investigation is twofold. First, we 
want to verify the efficiency of the novel Rao-Blackwell es- 
timator when applied as a post-processing step to the output 
of state-of-the-art sampling algorithms. Second, we want to 
test the hypothesis that the efficiency gains of the novel es- 
timator on standard SRL models are similar empirically to 
those of state-of-the-art lifted inference algorithms. 

For the SRL models we computed the orbit partitions 
of the variables based on the model's renaming automor- 
phisms (Bui, Huynh, and Riedel 2012 1. As discussed earlier, 



renaming automorphisms are computable in time linear in 
the domain size. We applied Gap ( |GAP 201Z) to compute 
the variables' orbit partition. For all non SRL models we 
computed the automorphism group and the orbit partitions 
as in ( Niepert 2012} using the graph automorphis m algo- 
rithm SAUCY ( |Darga, Sakallah, and Markov 2008) and the 
Gap system, respectively. Overall, the computation of the 
orbit partitions of the models' variables took less than one 
second for each of the probabilistic models we considered. 

We conducted experiments with several benchmark 
Markov logic networks, a statistical relational language gen- 
eral enough to capture numerous types of graphical mod- 
els (Richardson and Domingos 2006). Here, we used (a) the 
asthma-smokes-cancer MLN ( Venugopal and Gogate 2012) 



with 10% evideneffi (b) the "Friends & Smokers" MLN ex- 
actly as specified in (Singla and Domingos 2008) with 10% 
evidence; and the "Friends & Smokers" MLN with the tran- 
sitivity formula on the friends relation having weight 1.0, 
(c) without and (d) with 10% evidence. Each of the models 
had between 10 and 100 objects in the domain, leading to 
log-linear models with 10 2 -10 4 variables and 10 2 -10 6 fea- 
tures. We used Wfomc ( |Van den Broeck 2011) , to com- 
pute the exact single- variable marginals of the asthma MLN. 
For all other MLNs, existing exact lifted inference algorithm 
were unable to compute single-variable densities. In these 
cases, we performed several very long runs (burn-in 1 day; 
overall 5 days) of a Gibbs sampler guaranteed to be ergodic 
and made sure that state-of-the-art MCMC diagnostics indi- 
cated convergence (Brooks and Gelman 1998). 

We executed our implementation of the standard Gibbs 
sampler and ALCHEMY's implementation of the MC-SAT 
algorithm (Poon and Domingos 2006) on the MLNs based 
on 10 separate runs, without a burn-in period. For each sam- 
pling algorithm we computed the marginal densities with 
the standard estimator and the Rao-Blackwell estimator, re- 
spectively, which we implemented in the Gap program- 
ming languagepl Figured depicts, for each MLN, the av- 



For a random 10% of all people it is known (a) whether they 
smoke or not and (b) who 10 of their friends are. 
3 https://code.google.com/p/lifted-mcmc/ 
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Figure 3: Average time and number of sample points, respectively, needed to achieve an average KL divergence of < 0.0001. 



erage Kullback-Leibler divergencqj between the estimated 
and precomputed true single-variable marginals of the non- 
evidence variables plotted against the absolute running time 
of the algorithms in seconds. 

The Rao-Blackwell estimator improves the density esti- 
mates by at least an order of magnitude and, in the absence 
of evidence, even up to four orders of magnitude relative 
to the standard estimator. The improvement of the empir- 
ical results is independent of the relational structure of the 
MLNs. For the MLN with a transitivity formula on the 
friends relation, generally considered a problematic and 
as of now not domain-liftable model, the results are as pro- 
nounced as for the MLNs known to be domain-liftable. 

We also conducted experiments with non-SRL models to 
investigate the efficiency of the approach on graphical mod- 
els. We executed the Gibbs sampler with and without using 
the Rao-Blackwell estimator on a 100 x 100 2-coloring grid 
model with binary random variables. The symmetries of the 
model are the reflection and rotation automorphisms of the 
2-dimensional square grid. FigureEfe) depicts the plot of the 
average KL divergence against the running time in seconds, 
where each pairwise factor between neighboring variables 
X, Y was defined as exp(0.2) if X ^ Y, and 1 otherwise. 
Figure |2jf) depicts the plot of the same grid model except 
that the pairwise factors were defined as 1 if X ^ Y, and 
otherwise. The results clearly demonstrate the superior 
performance of the Rao-Backwell estimator even for proba- 
bilistic models with a smaller number of symmetries. 

In addition, we analyzed the impact of the domain size on 
the estimator performance for (a) domain-liftable MLNs and 
(b) MLNs not liftable by any state-of-the-art exact lifted in- 
ference algorithm. We used the "Friends & Smokers" MLN 
without evidence; and with and without the transitivity for- 
mula on the friends relation. The MLN without transi- 
tivity is a standard benchmark for lifted algorithms whereas 
MLNs with transitivity are considered difficult and no ex- 
act lifted inference algorithm exists for such MLNs as of 
now. Figures |3ja)&(c) depict the time needed to achieve 
an average KL divergence of less than 10~ 4 plotted against 
the domain size of the models without and with transitivity. 
The increase in runtime is far less pronounced with the Rao- 



We computed both MSE and average KL divergence but omit- 
ted the qualitatively identical MSE results due to space constraints. 



Blackwell estimator. The plots resemble those often shown 
in lifted inference papers where an algorithm that can lift a 
model is contrasted with one that cannot. The increase in 
runtime is slightly higher for the model with transitivity but 
this is owed to the size increase of each variable's Markov 
blanket and, thus, the time needed for each Gibbs sampler 
step. Figures [3jb)&(d) plots the sample size required to 
achieve an average KL divergence of less than 10~ 4 against 
the domain size. Interestingly, the number of sample points 
is almost identical for the model with and without transitiv- 
ity, indicating that the advantage of the Rao-Blackwell esti- 
mator is independent of the model's formulas. 

In Figure pta) we plot the results of WFOMC for com- 
piling a first-order circuit and computing (a) one single- 
variable marginal and (b) all single-variable marginals. 
WFOMC has constant runtime for exactly computing one 
single-variable marginal density. The Rao-Blackwell esti- 
mation for all of the model's variables scales sub-linearly 
and is more efficient than repeated calls to WFOMC. While 
we do not need to run WFOMC once per single-variable 
marginal density if the variables are first partitioned into sets 
of variables with identical marginal densities ( jde Salvo Braz,| 
| Amir, and Roth 2005 ) , the results demonstrate that the 
symmetry-aware estimator scales comparably to exact lifted 
inference algorithms on domain-liftable models and that its 
runtime is polynomial in the domain size of the MLNs. 

Discussion 

A Rao-Blackwell estimator was developed and shown, both 
analytically and empirically, to have lower mean squared 
error under non-trivial model symmetries. The presented 
theory provides a novel perspective on the notion of lifted 
inference and the underlying reasons for the feasibility of 
marginal density estimation in large but highly symmetric 
probabilistic models. For the first time, the applicability of 
such an approach does not directly depend on the proper- 
ties of the relational structure such as the arity of predicates 
and the type of formulas but only on the given evidence and 
the corresponding automorphism group of the model. We 
believe the theoretical and empirical insights to be of great 
interest to the machine learning community and that the pre- 
sented work might contribute to a deeper understanding of 
lifted inference algorithms. 
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Proof of Lemma |3] 

Lemma. Let X be a finite sequence of random variables 
with joint distribution -P(X), let be an automorphism 
group of X, let X be a subsequence of X, let X* be the 
orbit of IX., and let s £ Val(X). Then, 



P(X = x|s ) = 



"?-(') 



= E[0 N | H| = 



>(*)]■ 



Proof Let & s := {g <G & \ s a = s} be the stabilizer sub- 
group of s. Then, 



P(X 



x s = 



|{ fl eg5| s g(x)=x}| 

I (51 



since for each x G s we have that \{q e © | s B — x}\ — 
\& s \ by the orbit stabilizer theorem. For each A <G X® let 
® A : = {fl € © I A ° = x l- Again, by the orbit stabilizer 
theorem, we have that |© A | 
where 0-^ is the stabilizer subgroup of X. Hence, 



©xl for each A € X e 



|{ e©| S o(X)=x}|_H| =i ( s )|© x |_H| =i ( s ) 



l©l 



l<5| 



IX® I 



Hence, H? is a sufficient statistic for the marginal density 
P(X = x). Moreover, we have that 

P(X = x| S c5 )=E[0 Ar |H| =x ( S )]. 
This concludes the proof. □ 



Proof of Theorem |5] 

Theorem. Let X be a finite sequence of random variables 
with joint distribution -P(X), let (5 be an automorphism 
group o/X given by R generators, let ~X.be a subsequence 
o/X, let X s be the orbit o/X, and let 9 := P(X = x) 
be the marginal density to be estimated. The Rao-Blackwell 
estimator O 1 ^ has the following properties: 

(a) Its worst-case time complexity is 0(R\~K & 1+iVlX® |); 

(b) MSE[ff$] < MSE[0 N ]. 

The inequality of(b) is strict if there exists a joint assignment 
s with non-zero density and < H?_ (s) < |X® | > 1. 



Proof We first construct the set X® o nce, which has a 



worst-case time complexity of P|X®| (Holt, Eick, and 



O'Brien 2005 I. For each sample point, we have to access 
an array representing the values of the sample point at most 
|X e | times. This allows us, for each sample point s, to com- 
pute P(X = x | s e ) in time 0(|X 6 |) by Lemma|] 
is a sufficient statistic for 

HS_„] by Lemma 3 statement (b) follows from 
the Rao-Blackwell theorem (Blackwell 1947). If there ex- 
ists a joint assignment s with non-zero density and < 



Since H| = . 

E[0 N ' w« 



and 0j* = 



H® 

and the inequality is strict (Blackwell 1947[). 



. (s) < |X® | > 1, then 9 N is not a function of H!? 



□ 



Proof of Theorem |9] 

Theorem. Let X be a finite sequence of random variables 
with joint distribution -P(X), let Ai be an ergodic Markov 
chain with stationary distribution P, and let O be an orbit 
partition of M 's state space. Let 9 r ^ be the Rao-Blackwell 
estimator for N sample points St+i, ■■■, St+n collected 
from M., after discarding the first T sample points. Then, 
|bias(^jy )| < e ifT > r'(e), where r'(e) is the mixing time 
of the quotient Markov chain of M. with respect to O. 

Proof For a subsequence X of X, let £ := X = x 
be the marginal assignment whose density 9 is to be esti- 
mated, let Val(X) be the assignment space of X, and let 
S = {st+i, ••■, st+n} be the multiset of sample points col- 
lected from M., after discarding the first T sample points. 
Since O is a partition of the assignment space, we have that 

# = jj E p (£ I s& ) = E p « I o)4EW)- 

ses oeo ses 

Hence, 

E[^ b ] - ]T P ^ I °) E hseo}}, 
oeo 
where E[I{,, e o}] is the expectation of some sample point 
being located in the orbit O. E[I{ sG o}] defines a probability 
distribution over the space O. If the sample points are inde- 
pendent, then E[I {se0 }] = P(0), for all O £ O, and the 
estimator is unbiased. Since we collect sample points from 
a Markov chain we will often have that E[I{ se0 }] ^ P(0). 
By the assumptions and Proposition [7] the Markov chain 
M. is exactly lumpable with respect to O and, hence, for all 
states x of M, all t £ {1,2, ...}, and all orbits O £ O, we 

have that E o6 o Q'^ ) = Q'\ x& , °)> where Q\x, o) is 
the probability of the Markov chain M. being in state o after 
t simulation steps if the chain starts in state x. In addition, 
we start collecting sample points after T > r'(e) simulation 
steps and, thus, 

lj2\E[l { seo}]-P(0)\< 



oeo 



max JiV \Q' T {x & , 



0)-P(0)\\<e. 



Finally, |bias(0#)| = |E[0$ - $\\ = 



J2 P(£ I o)E[i {seo} ] - £ P(Z I 0)P[0) 



oeo 



oeo 



£p(£|0)(Ep[ {seO} ]-P(0)) 



oeo 



< 



£ (E[I {se0} ] - P{0)) 



oeo 

Hhseo}]>P(0) 



lY\E[l {s eo}]-P(0)\<e. 



oeo 



The last equality follows from a known identity of the total 
variation distance ( |Levin, Peres, and W ilmer 2008]). □ 



